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Abstract. As businesses increasingly rely on social networking sites to engage with their customers, it is crucial 
to understand and counter reputation manipulation activities, including fraudulently boosting the number of Face- 
book page likes using like farms. To this end, several fraud detection algorithms have been proposed and some 
deployed by Facebook that use graph co-clustering to distinguish between genuine likes and those generated by 
farm-controlled profiles. Flowever, as we show in this paper, these tools do not work well with stealthy farms whose 
users spread likes over longer timespans and like popular pages, aiming to mimic regular users. We present an 
empirical analysis of the graph-based detection tools used by Facebook and highlight their shortcomings against 
more sophisticated farms. Next, we focus on characterizing content generated by social networks accounts on their 
timelines, as an indicator of genuine versus fake social activity. We analyze a wide range of features extracted from 
timeline posts, which we group into two main classes: lexical and non-lexical. We postulate and verify that like farm 
accounts tend to often re-share content, use fewer words and poorer vocabulary, and more often generate duplicate 
comments and likes compared to normal users. We extract relevant lexical and non-lexical features and and use 
them to build a classifier to detect like farms accounts, achieving significantly higher accuracy, namely, at least 99% 
precision and 93% recall. 


1 Introduction 

Online social networks provide organizations and public figures with a range of tools to seamlessly reach out 
to, as well as broaden, their audience. Among these, Facebook pages make it easy to broadcast updates, publicize 
products and events, and get in touch with customers and fans. Facebook allows page owners to promote their pages 
via targeted advertisement. This constitutes one of the primary sources of revenue for Facebook, as its advertising 
platform is reportedly used by 2 million small businesses, out of the 40 million which have active pages [26]. 

At the same time, as the number of likes on a Facebook page is considered a measure of its popularity [9], an 
ecosystem of so-called “like farms ” has emerged that offer paid services to artificially inflate the number of likes 
on Facebook pages. These farms often rely on networks of fake and compromised accounts, as well as incentivized 
collusion networks where users are paid for actions from their account [32]. In prior work [11], we showed that 
some like farms follow a naive approach with a large number of accounts liking target pages within a short timespan. 
Whereas, others exhibit a stealthier behavior, gradually spreading likes over longer timespans, aiming to evade fraud 
detection algorithms. We found that only a handful of like farm accounts were detected by Facebook [11]. 

Facebook discourages users to buy fake likes, warning that they “can be harmful to your page”^ and routinely 
launches clean-up campaigns to remove fake accounts, including those engaged in like farms. Aiming to counter like 
farms, researchers as well as Facebook have been working on tools to detect fake likes (see Section 6). One currently 
deployed tool is CopyCatch, which detects lockstep page like patterns by analyzing the social graph between users 
and pages, and the times at which the edges in the graph are created [2]. Another one, SynchroTrap, relies on the 
fact that malicious accounts usually perform loosely synchronized actions in a variety of social network context, and 
can cluster malicious accounts that act similarly at around the same time for a sustained period of time [8]. The issue 
with these methods, however, is that stealthier (and more expensive) like farms can successfully circumvent them by 
spreading likes over longer timespans and liking popular pages to mimic normal users. 


t Authors contributed equally. 

*See https://www.facebook.eom/help/241847306001585. 



As a consequence, in this paper, we set to characterize the liking patterns of accounts associated with like farms and 
systematically evaluate the effectiveness of graph co-clustering fraud detection algorithms [2,8] in correctly identifying 
like farm accounts. We show that these tools incur signihcantly high false positives rates for stealthy farms, as their 
accounts mimic normal users. Next, we investigate the use of timeline information, including lexical and non-lexical 
characteristics of user posts, to improve the detection of like farm accounts. We crawl and analyze timelines of user 
accounts associated with like farms as well as a baseline of normal user accounts. Our analysis of timeline information 
highlights several differences in both lexical and non-lexical features of baseline and like farm users. In particular, we 
hnd that timeline posts by like farm accounts have 43% fewer words, a more limited vocabulary, and lower readability 
than normal users’ posts. Moreover, like farm posts generate signihcantly more comments and likes, and a large 
fraction of their posts consists of non-original and often redundant “shared activity” (i.e., repeatedly sharing posts 
from other users, articles, videos, and external URLs). 

Based on these timeline-based features, we train three classihers using supervised two-class support vector ma¬ 
chines (SVM) [21] and evaluate them using our ground-truth dataset. Our hrst and second classihers use, respectively, 
lexical and non-lexical features extracted from timeline posts, while the third one uses both. Our evaluation shows that 
the latter can accurately detect like farms accounts, achieving up to 99-100% precision and 93-97% recall. Finally, 
we generalize our approach using other state-of-the-art classiher algorithms, namely, decision tree [5], AdaBoost [13], 
kNN [1], random forest [4], and naive Bayes [36], and empirically conhrm that the SVM classiher achieves higher 
accuracy across the board. 

Paper Organization. The rest of the paper is organized as follows. Next section introduces the datasets used in 
our experiments, while Section 3 evaluates the accuracy of state-of-the-art co-clustering techniques to detect like farm 
accounts in our datasets. Next, we study timeline based features (both non-lexical and lexical) in Section 4, and evaluate 
classihers built using these features in Section 5. After reviewing related work in Section 6, the paper concludes in 
Section 7. 


2 Data 

Previous Campaigns. Our starting point are the Facebook accounts gathered as part of our prior work [11], which 
presented an exploratory analysis of Facebook like farms using honeypots. Specihcally, we created 13 Facebook pages 
called “Virtual Electricity” and, while keeping them empty (i.e., no posts/pictures), promoted eight of them using 
popular like farms and hve using Facebook “page like” ads. The eight like farm campaigns employed BoostLikes.com 
(BL), SocialFormula.com (SF), AuthenticLikes.com (AL), and MammothSocials.com (MS), each with one campaign 
targeting worldwide users and one targeting users in the USA, while the hve Facebook campaigns respectively targeted 
users in the USA, France, India, Egypt, and worldwide. In the rest of the paper, we use the campaign acronyms 
followed by the target audience, e.g., SF-ALL denotes the SocialFormula.com campaign targeting worldwide users. 
Note that BL-ALL and MS-ALL did not actually deliver any likes, even though they were paid. 

Overall, our campaigns [11] garnered 5,918 likes from 5,616 unique users: 1,437 unique accounts from Facebook 
ad campaigns and 4,179 unique accounts from the like farm campaigns. Note that some users liked more than one 
honeypot pages. After a few months, we checked how many accounts had been closed or terminated and found that 
624/5,616 accounts (11%) were no longer active. 

New Data Collection. In August 2015, we began to crawl the pages liked by each of the 4,179 like farm users 
from [11], using the Selenium web driver.^ We also collected basic information associated with each page, such as 
the total number of likes, category, and location, using the page identifier. Unlike in our previous study, we now also 
crawled the timelines of the like farm accounts. Specihcally, we collected timeline posts (up to a maximum of 500 
recent posts), the comments on each post, as well as the associated number of likes and comments on each post. 

Besides some accounts having become inactive (376), we also could not crawl the timeline of 24 users who had 
restricted the visibility of their timeline. Moreover, Facebook blocked all the accounts we used for crawling, so we 
stopped our data collection before we could completely hnish our data collection, hence, we missed further 109 users. 
In summary, our new data consists of 3,670 users out of the initial 4,179 users (88%) gathered in [11]. We collected 

^http://docs.seleniumhq.org/projects/webdriver/ 
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Campaign 

#Users 

#Pages 

Liked 

#Pages Liked 
(Unique) 

#Posts 

BL-USA 

583 

79,025 

37,283 

44,566 

SF-ALL 

870 

879,369 

108,020 

46,394 

SF-USA 

653 

340,964 

75,404 

38,999 

AT .-ATI. 

707 

162,686 

46,230 

61,575 

AL-USA 

827 

441,187 

141,214 

30,715 

MS-USA 

259 

412,258 

141,262 

12,280 

Tot. Farms 

3,899 

2,315,489 

549,413 

234,529 

Baseline [10] 

1,408 

79,247 

57,384 

34,903 


Table 1: Overview of the datasets used in our study. 


more than 234K posts (messages, shared content, check-ins, etc) for these accounts, noticing that 72% of them had at 
least 10 publicly visible posts. 

We also rely on a sample of 1,408 random accounts previously gathered by Chen et al. [10], which we use to form 
a baseline of “normal” accounts. For each of these accounts, we again collected posts from their timeline, their page 
likes, and information from these pages. 53% of the accounts had at least 10 visible posts on the timeline, and in total 
we collected about 35K posts. 

Table 1 provides a summary of the data used in this paper. Note that users who like more than one of the honeypot 
Facebook pages are included in all rows, hence the disparity between 3,670 (unique users) and 3,899. Overall, we 
gathered information from about 600K unique pages, liked by 3,670 like farm accounts and 1,408 normal accounts, 
and around 270K timeline posts. 

Ethical Considerations. Note that we collected openly available data such as (public) profile and timeline information, 
as well as page likes. Also, all data was encrypted at rest and has not been re-distributed. No personal information was 
extracted as we only analyzed aggregated statistics. We also consulted our Institutional Review Board (IRB), which 
classified our research as exempt. 


3 Limitations of Graph Co-Clustering Techniques 

Aiming to counter fraudulent activities, including like farms, Facebook has recently deployed detection tools such 
as CopyCatch [2] and SynchroTrap [8]. These tools use graph co-clustering algorithms to detect large groups of 
malicious accounts that like similar pages around the same time frame. However, as highlighted in our prior work [11], 
some stealthy like farms deliberately modify their behavior in order to avoid synchronized patterns, which might reduce 
the effectiveness of these detection tools. Specifically, while several farms use a large number of accounts (possibly 
fake or compromised) liking target pages within a short timespan, some spread likes over longer timespans and onto 
popular pages aiming to circumvent fraud detection algorithms. 

Experiments. We now set to evaluate the effectiveness of user-page graph co-clustering algorithms. We use the la¬ 
beled dataset of 3,670 users from six different like farms and the 1,408 baseline users, and employ a graph co-clustering 
algorithm to divide the user-page bipartite graph into distinct clusters [18]. Similar to CopyCatch [2] and Synchro- 
Trap [8], the clusters identified in the user-page bipartite graph represent near-bipartite cores, and the set of users in a 
near-bipartite core like the same set of pages. Since we are interested in distinguishing between two classes of users 
(like farm users and normal users), we set the target number of clusters at 2. 

Results. In Table 2, we report the ROC statistics of the graph co-clustering algorithm - specifically, true positives (TP), 
false positives (FP), true negatives (TN), false negatives (FN), Precision: (TP)/{TP + FP), Recall: (TP)/{TP + 
TN), and Fl-score, i.e., the harmonic average of precision and recall. Figure 1 visualizes the clustering results as 
user-page scatter plots. The x-axis represents the user index and the y-axis the page index. ^ The vertical black line 
marks the separation between two clusters. The points in the scatter plot are colored to indicate true positives (green), 
true negatives (blue), false positives (red), and false negatives (black). 

^To ease presentation, we exclude users and pages with less than 10 likes. 
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Campaign 

TP 

FP 

TN 

FN 

Precision 

Recall 

FI-score 

AL-USA 

681 

9 

569 

4 

98% 

99% 

99% 

AL-ALL 

448 

53 

527 

1 

89% 

99% 

94% 

BL-USA 

523 

588 

18 

0 

47 % 

100% 

64 % 

SF-USA 

428 

67 

512 

1 

86% 

100% 

94% 

SF-ALL 

431 

48 

530 

2 

90% 

99% 

95% 

MS-USA 

201 

22 

549 

2 

90% 

99% 

93% 


Table 2: Effectiveness of the graph co-clustering algorithm. 
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Fig.l : Visualization of graph co-clustering results. The vertical black line indicates the separation between two clusters. We note 
that the clustering algorithm fails to achieve good separation leading to a large number of false positives (red dots). 
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Analysis. We observe two distinct behaviors in the scatter plots: (1) “liking everything ” (vertical streaks), and (2) “ev¬ 
eryone liking a particular page” (horizontal streaks). Both like farm and normal users exhibit vertical and horizontal 
streaks in the scatter plots. 

While the graph co-clustering algorithm neatly separates users for AL-USA, it incurs false positives for other like 
farms. In particular, the co-clustering algorithm fails to achieve a good separation for BL-USA, where it incurs a large 
number of false positives, resulting in 47% precision. Further analysis reveals that the horizontal false positive streaks 
in BL-USA include popular pages, such as “Fast & Furious” and “SpongeBob SquarePants,” each with millions of 
likes. We deduce that stealthy like farms, such as BL-USA, use the tactic of liking popular pages aiming to mimic 
normal users, which reduces the accuracy of the graph co-clustering algorithm. 

Our results highlight the limitations of prior graph co-clustering algorithms in detecting fake likes by like farm 
accounts. We argue that fake liking activity is challenging to detect when only relying on monitoring the liking activity 
due to the increased sophistication of stealthier like farms. Therefore, as we discuss next, we plan to leverage the 
characteristics of timeline features to improve accuracy. 

4 Characterizing Timeline Features 

We now set to design and evaluate timeline-based detection of like farm accounts. We start by characterizing 
timeline activities with respect to two categories of features, non-lexical and lexical. We do so aiming to identify the 
most distinguishing features to be used by machine learning algorithms in order for accurately classifying like farms 
accounts and normal user accounts. 

Types of Posts. We start by analyzing how users interact with the posts on their timeline. Figure 2 plots the distribution 
of types of posts on users’ timelines. More than 50% of those made by baseline users are text, whereas, for like farm 
users this ratio is less than 44% as they post more web links and videos. Note that “Others” include shared posts, 
Facebook actions such as ‘listening to’, ‘traveling to’, ‘feeling’, and life events like ‘in a relationship’, and ‘married’. 
We find that this category includes about 22% of posts for like farm users and about 16% of posts for baseline users. 



Post Type 

Fig. 2 : Distribution of types of posts. 


4.1 Analysis of Non-Lexical Features 

Comments and Likes. In Figure 3(a), we plot the distributions of the number of comments a post attracts, revealing 
that users of AL-ALL like farm generate many more comments than the baseline users. We note that BL-USA is almost 
identical to the baseline users. Next, Figure 3(b) shows the number of likes associated with users’ posts, highlighting 
that posts of like farm users attract much more likes than those of baseline users. Therefore, posts produced by the 
former gather more likes (and also have lower lexical richness as shown later on in Table 3), which might actually 
indicate their attempt to mask suspicious activities. 

Shared Content. We next study the distributions of posts that are classified as “shared activity,” i.e., originally made 
by another user, or articles, images, or videos linked from an external URL (e.g., a blog or YouTube). Figure 3(c) 
shows that baseline users generate more original posts, and share fewer posts or links, compared to farm users. 
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Fig. 3: Distribution of non-lexical features for like farm and baseline accounts 


Words per Post. Figure 3(d) plots the distributions of number of words that make up a text-based post, highlighting 
that posts of like farm users tend to have fewer words. Roughly half of the users in four of the like farms (AL-ALL, 
BL-USA, SF-ALL, and SF-USA) use 10 or less words in their posts, versus 17 words by baseline users. 


4.2 Analysis of Lexical Features 

We now look at features that relate to the content of timeline posts. Although we only consider posts in English, 
similar lexical features could be extracted for other languages such as for Chinese [37]. The readers are referred to [24] 
for more details about features used in this paper. We have also considered user timelines as the collection of posts 
and the corresponding comments on each post (i.e., all textual content) and build a corpus of words extracted from 
the timelines by applying the term frequency-inverse document frequency (TF-IDF) statistical tool [22]. Flowever, the 
overall performance of this “bag-of-words” approach was poor, which can be explained with the short nature of the 
posts. Indeed, [15] has showed that the word frequency approach to analyze short text on social media and blogs does 
not perform well. Thus, in our work, we disregard simple TF-IDF based analysis of user timelines and identify other 
lexical features. 

Language. Next, we analyze the ratio of posts in English, i.e., for every post we filter out all non-English ones using 
a standard language detection library."^ Eor each user, we count the number of English-language posts and calculate 
its ratio with respect to the total number of posts. Eigure 4 shows that the baseline users and like farm users in USA 
(i.e., MS-USA, BL-USA, and AL-USA) mostly post in English, while users of worldwide campaigns (MS-ALL, 
BE-ALL, AL-ALL) have signihcantly fewer posts in English. Lor example, the median ratio of English posts for 
AL-ALL campaign is around 10% and that for SL-ALL around 15%. We acknowledge that our analysis is limited 
to English-only content and may be statistically biased toward native English speakers i.e, non-USA users. While 
our analysis should be extended to other languages, we argue that English-based lexical analysis provides sufficient 
differences across different categories of users. Thus, developing algorithms for language detection and processing on 
non-English posts is out of the scope of this paper. 

^https://python.org/pypi/langdetect 
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Fig. 4: Distributions of the ratio of English to non-English posts. 


Campaign 

Avg 

Chars 

Avg 

Words 

Avg 

Sents 

Avg Sent 
Length 

Avg Word 
Length 

Richness 

ARl 

Flesch 

Score 

Baseline 

4,477 

780 

67 

6.9 

17.6 

0.70 

20.2 

55.1 

BL-USA 

7,356 

1,330 

63 

5.7 

22.8 

0.58 

16.9 

51.5 

AT .-ALT. 

2,835 

464 

32 

6.2 

13.9 

0.59 

14.8 

43.6 

AL-USA 

2,475 

394 

33 

6.2 

12.7 

0.49 

14.1 

54.0 

SE-ALL 

1,438 

227 

19 

6.3 

11.7 

0.58 

14.1 

45.2 

SE-USA 

1,637 

259 

22 

6.3 

12.0 

0.55 

14.4 

45.6 

MS-USA 

6,227 

1,047 

66 

6.1 

17.8 

0.53 

16.2 

50.1 


Table 3: Lexical analysis of timeline posts. 


Readability. We further analyze posts for grammatical and semantic correctness. We parse each post to extract the 
number of words, sentences, punctuation, non-letters (e.g., emoticons), and measure the lexical richness, as well as 
the Automated Readability Index (ARl) [25] and Flesch score [12]. Lexical richness, defined as the ratio of number 
of unique words to total number of words, reveals noticeable repetitions of distinct words, while the ARl, computed 
as 4.71 X average word length) + (0.5 x average sentence length) - 21.43, estimates the comprehensibility of a text 
corpus. Table 3 shows a summary of the results. In comparison to like farm users, baseline users post text with higher 
lexical richness (70% vs. 55%), ARl (20 vs. 15), and Flesch score (55 vs. 48), thus suggesting that normal users use a 
richer vocabulary and that their posts have higher readability. 


4.3 Summary & Takeaways 

Our analysis of user timelines highlights several differences in both lexical and non-lexical features of normal and 
like farm users. In particular, we find that posts made by like farm accounts have 43% fewer words, a more limited 
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Campaign 

Total 

Users 

Training 

Set 

Testing 

Set 

TP 

FP 

TN 

FN 

Precision 

Recall 

Accuracy 

Fl- 

Score 

BL-USA 

583 

466 

117 

37 

12 

270 

80 

76% 

32% 

77% 

45% 

AL-ALL 

707 

566 

141 

132 

5 

278 

9 

96% 

94% 

97% 

95% 

AL-USA 

827 

662 

164 

113 

4 

278 

51 

97% 

69% 

88% 

81% 

SF-ALL 

870 

696 

174 

139 

9 

273 

35 

94% 

80% 

90% 

86% 

SF-USA 

653 

522 

131 

110 

5 

277 

21 

96% 

84% 

94% 

90% 

MS-USA 

259 

207 

52 

39 

2 

280 

13 

95% 

75% 

96% 

84% 


Table 4: Effectiveness of non-lexical features (+SVM) in detecting like farm users. 


Campaign 

Total 

Users 

Training 

Set 

Testing 

Set 

TP 

FP 

TN 

FN 

Precision 

Recall 

Accuracy 

Fl- 

Score 

BL-USA 

564 

451 

113 

113 

0 

240 

0 

100% 

100% 

100% 

100% 

AL-ALL 

675 

540 

135 

129 

2 

238 

6 

98% 

96% 

98% 

97% 

AL-USA 

570 

456 

114 

113 

0 

240 

1 

100% 

99% 

99% 

99% 

SF-ALL 

761 

609 

152 

150 

1 

239 

2 

99% 

99% 

99% 

99% 

SF-USA 

570 

456 

114 

99 

2 

238 

15 

98% 

87% 

95% 

92% 

MS-USA 

224 

179 

45 

45 

0 

240 

0 

100% 

100% 

100% 

100% 


Table 5: Effectiveness of lexical features (+SVM) in detecting like farm users. 


vocabulary, and lower readability than normal users’ posts. Moreover, like farm users generate significantly more 
comments and likes and a large fraction of their posts consists of non-original and often redundant “shared activity”. 

In the next section, we will use these timelines features to automatically detect like farm users using a machine 
learning classifier. 


5 Timeline-based Detection of Like Farms 

Aiming to automatically distinguish like farm users from normal (baseline) users, we use a supervised two-class 
SVM classifier [21], implemented using scikit-leam [ 6 ] (an open source machine learning library for Python). We 
later compare this classifier with other well-known supervised supervisor such as Decision Tree [5], AdaBoost [13], 
kNN [1], Random Forest [4], and Naive Bayes [36] and confirm that the two-class SVM is the most effective in 
detecting like farms users. 

We extract four non-lexical features and twelve distinct lexical features from the timelines of baseline and like farm 
users, as explained in Section 4. The non-lexical features are the average number of words, comments, likes per post, 
and re-shares. The lexical features include: the number of characters, words, and sentences; the average word length, 
sentence length, and number of upper case letters; the average percentage of punctuation, numbers, and non-letter 
characters; richness, ARI, and Flesch Score. 

We form two classes by labeling like farm and baseline users’ lexical and non-lexical features as positives and 
negatives, respectively. We use 80% and 20% of the features to build the training and testing sets, respectively. Appro¬ 
priate values for parameters 7 {radial basis function kernel parameter [23]) and v (SVM parameter) are set empirically 
by performing a greedy grid search on ranges < 7 < 2 ° and < 2 °, respectively, on each training 

group. 

Non-Lexical Features. Table 4 reports on the accuracy of our classifier with non-lexical features, i.e., users inter¬ 
actions with posts as described in Section 4.1. Note that for each campaign, we train the classifier with 80% of the 
non-lexical features from baseline and campaign training sets derived from the campaign users timelines. The poor 
classification performance for the stealthiest like farm (BL-USA) suggests that non-lexical features alone are not suf¬ 
ficient to accurately detect like farm users. 

Lexical Features. Next, we evaluate the accuracy of our classifier with lexical features, reported in Table 5. We filter 
out all users with no English-language posts (i.e, with R=0, see Figure 4). Again, we train the classifier with 80% 
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Campaign 

Total 

Users 

Training 

Set 

Testing 

Set 

TP 

FP 

TN 

FN 

Precision 

Recall 

Accuracy 

Fl- 

Score 

BL-USA 

583 

466 

117 

113 

1 

281 

4 

99% 

97% 

99% 

98% 

AL-ALL 

707 

566 

141 

137 

1 

281 

4 

99% 

97% 

99% 

98% 

AL-USA 

827 

662 

164 

157 

1 

281 

7 

99% 

96% 

98% 

97% 

SF-ALL 

870 

696 

174 

163 

2 

280 

11 

99% 

94% 

97% 

96% 

SF-USA 

653 

522 

131 

122 

1 

281 

9 

99% 

93% 

98% 

96% 

MS-USA 

259 

207 

52 

50 

0 

282 

2 

100% 

96% 

99% 

98% 


Table 6: Effectiveness of both lexical and non-lexical features (+SVM) in detecting like farm users. 


Campaign 

SVM 

Decision Tree AdaBoost kNN 

Random Forest 

Naive Bayes 

BL-USA 

98% 

96% 

96% 91% 

88% 

53% 

AL-ALL 

98% 

84% 

95% 86% 

84% 

75% 

AL-USA 

97% 

88% 

90% 91% 

86% 

81% 

SF-ALL 

96% 

90% 

94% 89% 

87% 

67% 

SF-USA 

96% 

83% 

92% 79% 

78% 

61% 

MS-USA 

98% 

90% 

89% 89% 

87% 

74% 

Table 7: FI-Score with different classihcation methods, using both lexical and non-lexical features, in detecting like farm users. 

lexical features from baseline and like farm training sets. 

. We observe that our classiher achieves very high precision 


and recall for MS-USA, BL-USA, and AL-USA. Although the accuracy decreases by approximately 8% for SF-USA, 
the overall performance suggests that lexical features are useful in automatically detecting like farm users. 


Combining Lexical and Non-Lexical Features. Approximately 3% to 22% of like farm users and 14% of baseline 
users do not have English language posts and are not considered in the lexical features based classification. To include 
them in the classification, for each like farm and baseline, we set their lexical features to zeros and aggregate the 
lexical features with non-lexical features, and evaluate our classifier with the same classihcation methodology as 
detailed above. Results are summarized in Table 6, which shows high accuracy for all like farms (FI-Score > 96%), 
thus conhrming the effectiveness of our timeline-based features in detecting like farm users. 

Comparison With Other Machine Learning Classifiers. In order to generalize our approach, we have also used 
other machine learning classification algorithms, i.e.. Decision Tree, AdaBoost, kNN, Random Forest, and Naive 
Bayes. The training and testing of all these classifiers follow the same set-up as the SVM approach. We again use 
80% and 20% of the combined lexical and non-lexical features to build the training and testing sets, respectively. We 
summarize the performance of the classifiers in Table 7. Our results show that the SVM classifier achieves the highest 
Fl-Scores across the board. Due to overhtting on our dataset. Random Forest and Naive Bayes show poor results 
and require mechanism such as pruning, detailed analysis of parameters, as well as selection of the optimal set of 
prominent features to improve classihcation performance [4] [19]. 


Analysis. We now analyze in more details the classihcation performance (in terms of FI-Score) to identify the most 
distinctive features. Specihcally, we incrementally add lexical and non-lexical features to train and test our classiher 
for all campaigns. We observe that the average word length (cf. Figure 5(a)) and average number of words per post 
(cf. Figure 5(b)) provide the most improvement in the FI-Score for all campaigns. This hnding suggests that like 
farm users use shorter words and fewer number of words in their timeline posts as compared to baseline users. While 
these features provide the largest improvement in detecting a like farm account, an attempt to circumvent detection by 
increasing the word length or number of words per post will also effect the ARI, Flesch score, and richness. That is, 
increasing word length and number of words on posts in a way that is not readable nor understandable, will not improve 
the overall outlook of the account to appear real. Therefore, combining several features increases the workload required 
to appear real on like farm accounts. The overall classihcation accuracy with both lexical and non-lexical features is 
reported in Figure 5(c). 
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Fig. 5 : Cumulative FI-Score for all lexical and non-lexical features measured. The X-axis shows the incremental inclusion of 
features in both training and testing of SVM. Details of the classification performance for all features are listed in Table 6. 


Remarks. Our results demonstrate that it is possible to accurately detect like farm users from both sophisticated and 
naive farms by incorporating additional account information - specifically, timeline activities. We also argue that the 
use of a variety of lexical and non-lexical features will make it difficult for like farm operators to circumvent detection. 
Like farms typically rely on pre-defined lists of comments, resulting in word repetition and lower lexical richness. 
As a result, we argue that, should our proposed techniques be deployed by Facebook, it will be challenging, as well 
as costly, for fraudsters to modify their behavior and evade detection, since this would require instructing automated 
scripts and/or cheap human labor to match the diversity and richness of real users’ timeline posts. 


6 Related Work 

Prior work has focused quite extensively on the analysis and the detection of fake accounts in online social net¬ 
works [3,7,14,34,35]. By contrast, we focus on detecting accounts that are employed by like farms to boost the number 
of Facebook page likes, whether they are operated by a hot or a human. 

Our classifier is trained using accounts obtained from honeypots, somewhat similar to previous work in the context 
of spam in My Space and Twitter [20,29]. Our work uses accounts attracted by Facebook pages actively engaging 
like farms and, unlike [20,29], leverages timeline-based features for the detection. Wang et al. [33] study the human 
involvement in Weibo’s reputation manipulation services, showing that simple evasion attacks (e.g., workers modifying 
their behavior) as well as poisoning attacks (e.g., administrators tampering with the training set) can severely affect 
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the effectiveness of machine learning algorithms to detect malicious crowd-sourcing workers. Partially informed by 
their work, we do not only cluster like activity performed by users but also build on lexical and non-lexical features. 

Other studies have analyzed services that sell Twitter followers [28], fake and compromised Twitter accounts [31], 
as well as crowdturfing in social networks [27]. Specific to Facebook fraud is CopyCatch [2], a technique currently 
deployed by Facebook to detect fraudulent accounts by identifying groups of connected users liking a set of pages 
within a short time frame. SynchroTrap [8] extends CopyCatch by clustering accounts that perform similar, possibly 
malicious, synchronized actions, using tunable parameters such as time-window and similarity thresholds in order to 
improve detection accuracy. However, as highlighted in our prior work [11], while some farms seem to be operated 
by hots (producing large bursts of likes and having limited numbers of friends) that do not really trying to hide their 
activities, others, stealthier farms exhibit behavior that may be challenging to detect with tools like CopyCatch and 
SynchroTrap. In fact, our evaluation of graph co-clustering techniques shows that these farms successfully evade de¬ 
tection by avoiding lockstep behavior and liking sets of seemingly random pages. As a result, we decide to use timeline 
features, relying on both lexical and non-lexical features to build a classifier detecting stealthy like farm users with 
high accuracy. Our work can complement other methods used in prior work to detect fake and compromised accounts, 
such as using unsupervised anomaly detection techniques [32], temporal features [16,17], or IP addresses [30]. 

Finally, we stress that our prior work [11] only presents an exploratory measurement study of like farms, based 
on the characteristics of the accounts that liked a few honeypot pages. Specifically, [11] analyzes the geographic and 
demographic distribution of garnered likes, the temporal patterns observed for each campaign, as well as the social 
graph induced by the likers. Whereas, in this paper, we take a significant step further; although we re-use the honeypot 
campaigns to build a corpus of like farm users, (i) we demonstrate that temporal and social graph analysis can only be 
used to detect naive farms, and (ii) we introduce a timeline-based classifier that achieves a remarkably high degree of 
accuracy. 


7 Conclusion 

The detection of fraudulent accounts in online social networks is crucial to maintain confidence among legitimate 
users and investors. In this paper, we focused on detecting accounts used by Facebook like farms, i.e., paid services 
artificially boosting the number of likes on a given Facebook page. We crawled liking patterns and timeline activities 
from like farms accounts and from a baseline of normal users. We evaluated the effectiveness of existing graph based 
fraud detection algorithms, such as CopyCatch [2] and SynchroTrap [8], and demonstrated that sophisticated like 
farms can successfully evade detection. 

Aiming to address this problem, we set to incorporate additional profile information from accounts’ timelines to 
train machine learning classifiers geared to distinguish between like farm users from normal ones. We first experi¬ 
mented with term frequency-inverse document frequency (TF-IDF) but achieve relatively poor performance. We then 
turned to lexical and non-lexical features from user timelines. We found that posts made by like farm accounts have 
43% fewer words, a more limited vocabulary, and lower readability than normal users’ posts. Moreover, like farm posts 
generated significantly more comments and likes, and a large fraction of their posts consists of non original and often 
redundant “shared activity” (i.e., repeatedly sharing posts made by other users, articles, videos, and external URLs). 
By leveraging both lexical and non-lexical features, we experimented with several machine learning classifiers, with 
the best of our classifiers (SVM) achieving as high as 100% precision and 97% of recall, and at least 99% and 93% 
respectively across all campaigns - significantly higher than graph co-clustering techniques. 

In theory, fraudsters could try to modify their behavior in order to evade our proposed timeline-based detection. 
However, like farms either heavily automate mechanisms or rely on manual input of cheap human labor. Therefore, 
since non-lexical features are extracted from users’ interactions with timeline posts, imitating normal users’ behaviors 
will likely incur an remarkably higher cost. Even higher would be the cost to interfere with lexical features, since this 
would entail modifying or imitating normal users’ writing style. 
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